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Abstract 

The  computational  demands  of  software  continue  to  outpace  the  capacities  of  processor  and 
memory  technologies,  especially  in  scientific  and  engineering  programs.  One  option  to  improve 
performance  is  parallel  processing.  However,  despite  decades  of  research  and  development, 
writing  parallel  programs  continues  to  be  difficult.  This  is  especially  the  case  for  scientists  and 
engineers  who  have  limited  backgrounds  in  computer  science.  MATLAB®,  due  to  its  ease  of  use 
compared  to  other  programming  languages  like  C  and  Fortran,  is  one  of  the  most  popular 
languages  for  implementing  numerical  computations,  thus  making  it  an  excellent  platform  for 
developing  an  accessible  parallel  computing  framework. 

The  MIT  Lincoln  Laboratory  has  developed  two  libraries,  pMatlab  and  MatlabMPI,  that  not  only 
enables  parallel  programming  with  MATLAB  in  a  simple  fashion,  accessible  to  non-computer 
scientists.  This  document  will  overview  basic  concepts  in  parallel  programming  and  introduce 
pMatlab. 

1.  Introduction 

Even  as  processor  speeds  and  memory  capacities  continue  to  rise,  they  never  seem  to  be  able  to 
satisfy  the  increasing  demands  of  software;  scientists  and  engineers  routinely  push  the  limits  of 
computing  systems  to  tackle  larger,  more  detailed  problems.  In  fact,  they  often  must  scale  back 
the  size  and  accuracy  of  their  simulations  so  that  they  may  complete  in  a  reasonable  amount  of 
time  of  fit  into  memory.  The  amount  of  processing  power  directly  impacts  the  scale  of  the 
problem  being  addressed. 

One  option  for  improving  performance  is  parallel  processing.  There  have  been  decades  of 
research  and  development  in  hardware  and  software  technologies  to  allow  programs  to  run  on 
multiple  processors,  improving  the  performance  of  computationally  intensive  applications.  Yet 
writing  accurate,  efficient,  high  performance  parallel  code  is  still  highly  non-trivial,  requiring  a 
substantial  amount  of  time  and  energy  to  master  the  discipline  of  high  performance  computing, 
commonly  referred  to  as  HPC.  However,  the  primary  concern  of  most  scientists  and  engineers  is 
conduct  research,  not  to  write  code. 

MATLAB®  is  the  dominant  programming  language  for  implementing  numerical  computations 
and  is  widely  used  for  algorithm  development,  simulation,  data  reduction,  testing  and  evaluation. 
MATLAB  users  achieve  high  productivity  because  one  line  of  MATLAB  code  can  typically 
replace  multiple  lines  of  C  or  Fortran  code.  In  addition,  MATLAB  supports  a  range  of  operating 
systems  and  processor  architectures,  providing  portability  and  flexibility.  Thus  MATLAB 
provides  an  excellent  platform  on  which  to  create  an  accessible  parallel  computing  framework. 

The  MIT  Lincoln  Laboratory  has  been  developing  technology  to  allow  MATLAB  users  to 
benefit  from  the  advantages  of  parallel  programming.  Specifically,  two  MATLAB  libraries  - 
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pMatlab  and  MatlabMPI  -  have  been  created  to  allow  MATLAB  users  to  run  multiple  instances 
of  MATLAB  to  speed  up  their  programs.  This  document  is  targeted  towards  MATLAB  users 
who  are  unfamiliar  with  parallel  programming.  It  provides  a  high-level  introduction  to  basic 
parallel  programming  concepts,  the  use  of  pMatlab  to  parallelize  MATLAB  programs  and  a 
more  detailed  understanding  of  the  strengths  of  global  array  semantics,  the  programming  model 
used  by  pMatlab  to  simplify  parallel  programming. 

The  rest  of  this  paper  is  structured  as  follows.  Section  2  explains  the  motivation  for  developing 
technologies  that  simplify  parallel  programming,  such  as  pMatlab.  Section  3  provides  a  brief 
overview  of  parallel  computer  hardware.  Section  4  discusses  the  difficulties  in  achieving  high 
performance  computation  in  parallel  programs.  Section  5  introduces  the  single-program 
multiple-data  model  for  constructing  parallel  programs.  Section  6  describes  the  message-passing 
programming  model,  the  most  popular  parallel  programming  model  in  use  today,  used  by 
MatlabMPI.  Section  7  describes  a  global  array  semantics,  a  new  type  of  parallel  programming 
model  used  by  pMatlab  and  Section  8  provides  an  introduction  to  the  pMatlab  library  and  how  to 
write  parallel  MATLAB  programs  using  pMatlab.  Section  9  discusses  how  to  measure 
performance  of  parallel  programs  and  Section  10  concludes  with  a  summary. 

2.  Motivation 

As  motivation  for  many  of  the  concepts  and  issues  to  be  presented  in  this  document,  we  begin  by 
posing  the  following  question:  what  makes  parallel  programming  so  difficult?  While  we  are 
familiar  with  serial  programming,  the  transition  to  thinking  in  parallel  can  easily  become 
overwhelming.  Parallel  programming  introduces  new  complications  not  present  in  serial 
programming  such  as: 

•  Keeping  track  of  which  processor  data  is  located  on. 

•  Determining  which  processor  data  needs  to  be  delivered  to  or  received  from. 

•  Distributing  data  such  that  communication  between  processors  is  minimized. 

•  Distributing  computation  such  that  computation  on  each  processor  is  maximized. 

•  Synchronizing  data  between  processors  (if  necessary). 

•  Debugging  and  deciphering  communication  between  processors. 

Consider  transposing  a  matrix.  Let  A  be  a  4x4  matrix.  In  a  serial  program,  the  value  Ajj  is 
swapped  with  the  value  Aj  j.  In  a  parallel  program,  matrix  transpose  is  much  more  complicated. 
Let  A  be  distributed  column-wise  across  four  processors,  such  that  column  n  resides  on 
processor  n-1,  e.g.  Au,  A2ji,  A3J  and  A4,i  are  on  processor  0,  Ai-2,  A2j2,  A3,2  and  A4,2  are  on 
processor  1,  and  so  on.  (Numbering  processors  starting  at  0  is  common  practice.)  To  correctly 
transpose  the  matrix,  data  must  “move”  between  processors.  For  example,  processor  0  sends 
A2,i,  A3J  and  A44  to  processors  1,  2  and  3  and  receives  Ai,2,  Ai-3  and  Ai,4  from  processors  1,  2 
and  3,  respectively.  Figure  1  depicts  a  parallel  transpose. 

Clearly,  the  increase  in  speed  and  memory  associated  with  parallel  computing  requires  a  change 
in  the  programmer’s  way  of  thinking;  she  must  now  keep  track  of  where  data  reside  and  where 
data  needs  to  be  moved  to.  Becoming  a  proficient  parallel  programmer  can  require  years  of 
practice.  The  goal  of  pMatlab  is  to  hide  these  details,  to  provide  an  environment  where 
MATLAB  programmers  can  benefit  from  parallel  processing  without  focusing  on  these  details. 
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Figure  1  -  Example  of  a  parallel  transpose.  Each  column  resides  on  a  different  processor,  represented  by  a 
different  color.  Swapping  a  column  with  a  row  requires  communicating  with  multiple  processors. 


3.  Parallel  Computers 

Any  discussion  on  parallel  programming  is  incomplete  without  a  discussion  of  parallel 
computers.  An  exhaustive  discussion  of  parallel  computer  architecture  is  outside  the  scope  of 
this  paper.  Rather,  this  section  will  provide  a  very  high-level  overview  of  parallel  hardware 
technology. 

There  are  numerous  ways  to  build  a  parallel  computer.  On  one  end  of  the  spectrum  is  the 
multiprocessor,  in  which  multiple  processors  are  connected  via  a  high  performance 
communication  medium  and  are  packaged  into  a  single  computer.  Many  modern 
“supercomputers,”  such  as  those  made  by  Cray  and  IBM,  are  multiprocessors  composed  of 
dozens  to  hundreds  to  even  thousands  of  processors  linked  by  high-performance  interconnects. 
Twice  a  year,  the  TOP500  project  ranks  the  500  most  powerful  computers  in  the  world  [1],  The 
top  ranked  computer  on  the  TOP500  list  as  of  June  2005  is  IBM’s  Blue  Gene/L  at  Lawrence 
Livermore  National  Laboratory  [2],  Blue  Gene/L  had  65,536  processors,  with  plans  to  increase 
that  number  even  further.  Even  many  modern  desktop  machines  are  multiprocessors,  supporting 
two  CPUs. 

On  the  other  end  is  the  Beowulf  cluster  which  uses  commodity  networks  (e.g.  Gigabit  Ethernet) 
to  connect  commodity  computers  (e.g.  Dell)  [3].  The  advantage  of  Beowulf  clusters  is  their 
lower  costs  and  greater  availability  of  components.  Performance  of  Beowulf  clusters  has  begun 
to  rival  that  of  traditional  “supercomputers.”  In  November  2003,  the  TOP500  project  ranked  a 
Beowulf  cluster  composed  of  1100  dual-G5  Apple  XServe’s  at  Virginia  Tech  the  3ld  most 
powerful  computer  in  the  world  [4].  MATLAB  supports  many  different  operating  systems  and 
processor  architectures;  thus,  components  for  building  a  Beowulf  cluster  to  run  pMatlab  and 
MatlabMPI  are  relatively  cheap  and  widely  available.  Figure  2  shows  the  differences  between 
multiprocessors  and  Beowulf  clusters. 
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Figure  2  -  Multiprocessors  often  connect  processors  with  high-performance  interconnects  in  a 
single  computer.  Beowulf  clusters  connect  commodity  computers  with  conventional  networks. 


4.  Computation  and  Communication 

Parallel  programming  improves  performance  by  breaking  down  a  problem  into  smaller  sub¬ 
problems  that  are  distributed  to  multiple  processors.  Thus,  the  benefits  are  two-fold.  First,  the 
total  amount  of  computation  performed  by  each  individual  processor  is  reduced,  resulting  in 
faster  computation.  Second,  the  size  of  the  problem  can  be  increased  by  using  more  memory 
available  on  multiple  processors. 

At  first  glance,  it  appears  that  adding  more  processors  will  cause  a  parallel  program  to  keep 
running  faster,  right?  Not  necessarily.  A  common  misconception  by  programmers  new  to 
parallel  programming  is  that  adding  more  processors  always  improves  performance  of  parallel 
programs.  There  are  several  basic  obstacles  that  restrict  how  much  parallelization  can  improve 
performance. 
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Figure  3  -  Serial  runtime  for  an  application  is  shown  on  top.  Fifty  percent  of  the  runtime  is  parallelizable. 
Amdahl’s  Law  states  that  even  if  the  parallel  section  is  reduced  to  0,  total  runtime  will  be  reduced  by  50%. 


4.1.  Amdahl’s  Law 

Amdahl’s  Law  was  originally  defined  for  processor  architectures,  but  applies  equally  well  to 
parallel  programming.  Suppose  50%  of  a  program’s  runtime  can  be  parallelized.  Amdahl’s  Law 
states  that  even  if  the  runtime  of  the  parallelized  portion  can  be  reduced  to  zero,  total  runtime  is 
only  reduced  by  50%,  as  shown  in  Figure  3.  Thus,  Amdahl’s  Law  computes  the  theoretical 
maximum  performance  improvement  for  parallel  programs.  See  Section  9  for  more  details. 

4.2.  Division  of  labor 

Dividing  and  distributing  a  problem  to  multiple  processors  incurs  an  overhead  absent  in  serial 
programs.  Parallel  programmers  accept  this  overhead  since,  ideally,  the  time  required  for  this 
overhead  is  much  less  than  the  time  saved  due  to  parallelization.  However,  as  the  number  of 
processors  increases,  so  does  the  overhead.  Eventually,  this  overhead  can  become  larger  than  the 
time  saved.  Programmers  must  strike  a  balance  between  the  time  spent  on  overhead  and  the  time 
saved  from  parallelizing  to  maximize  the  net  time  saved.  Much  of  the  overhead  incurred  by 
distributing  a  problem  is  caused  by  communication,  which  is  discussed  in  the  next  section. 

4.3.  Communication 

Almost  all  parallel  programs  require  some  communication.  Clearly  communication  is  required 
to  distribute  sub-problems  to  processors.  However,  even  after  a  problem  is  distributed, 
communication  is  often  required  during  processing,  e.g.  redistributing  data  or  synchronizing 
processors.  Regardless,  communication  adds  overhead  to  parallel  processing.  As  the  number  of 
processors  grows,  work  per  processor  decreases  but  communication  can  -  and  often  does  - 
increase.  It  is  not  unusual  to  see  performance  increase,  peak,  then  decrease  as  the  number  of 
processors  grows,  due  to  an  increase  in  communication.  This  is  referred  to  as  slowdown  or 
speeddown  and  can  be  avoided  by  carefully  choosing  the  number  of  processors  such  that  each 
processor  has  enough  work  relative  to  the  amount  of  communication.  Finding  this  balance  is  not 
generally  intuitive;  it  often  requires  an  iterative  process. 
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Figure  4  -  Example  of  how  combining  messages  to  reduce  communication  overhead  can  improve 
performance  incurring  communication  overhead  once  rather  than  multiple  times. 


One  way  to  mitigate  the  effects  of  communication  on  performance  is  to  reduce  the  total  number 
of  individual  messages  by  sending  a  few  large  messages  rather  than  sending  many  small 
messages.  All  networking  protocols  incur  a  fixed  amount  of  overhead  when  sending  or  receiving 
a  message.  Rather  than  sending  multiple  messages,  it  is  often  worthwhile  to  wait  and  send  a 
single  large  message,  thus  incurring  the  network  overhead  once  rather  than  multiple  times  as 
depicted  in  Figure  4.  Depending  on  the  communication  medium,  overhead  may  be  large  or 
small.  In  the  case  of  pMatlab  and  MatlabMPI,  overhead  per  message  can  be  large;  therefore,  it  is 
to  the  user’s  advantage  not  only  to  reduce  the  total  amount  of  communication  but  also  reduce  the 
total  number  of  messages,  when  possible. 

5.  Single-Program  Multiple-Data 

There  are  multiple  models  for  constructing  parallel  programs.  At  a  very  high  level,  there  are  two 
basic  models:  multiple-program  multiple-data  ( MPMD )  and  single-program  multiple-data 
0 SPMD ). 

In  the  MPMD  model,  each  processor  executes  a  different  program  (“multiple-program”),  with 
the  different  programs  working  in  tandem  to  process  different  data  (“multiple-data”).  Streaming 
signal  processing  chains  are  excellent  examples  of  applications  that  can  be  parallelized  using  the 
MPMD  model.  In  a  typical  streaming  signal  processing  chain,  data  streams  in  and  flows  through 
a  sequence  of  operations.  In  a  serial  program,  each  operation  must  be  processed  in  sequential 
order  on  a  single  processor.  In  a  parallel  program,  each  operation  can  be  a  separate  program 
executing  simultaneously  on  different  processors.  This  allows  the  parallel  program  to  process 
multiple  sets  of  data  simultaneously,  in  different  stages  of  processing. 

Consider  an  example  of  a  simulator  that  generates  simulated  data  to  test  a  signal  processing 
application,  shown  in  Figure  5.  The  simulator  is  structured  as  a  pipeline ,  in  which  data  flow  in 
one  direction  through  various  stages  of  processing.  Stages  in  the  pipeline  can  be  executed  on 
different  processors  or  a  single  stage  could  even  execute  on  multiple  processors.  If  two 
consecutive  stages  each  run  on  multiple  processors,  it  may  be  necessary  to  redistribute  data  from 
one  stage  to  the  next;  this  is  known  as  a  cornerturn. 
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Figure  5  -  Example  of  a  MPMD  signal  processing  chain.  Each  box  and  color  represents  a  stage  in 
the  pipeline  and  unique  set  of  processors,  respectively.  Curved  arrows  indicate  cornerturns. 


Figure  6  -  Each  arrow  represents  one  Monte  Carlo  simulation.  The  top  set  of  arrows  represents  serial 
execution  of  the  simulations.  The  bottom  set  of  arrows  represent  parallel  execution  of  the  simulations. 


Note  that  a  pipeline  does  not  necessarily  speed  up  processing  a  single  frame.  However,  since 
each  stage  executes  on  different  processors,  the  pipeline  can  process  multiple  frames 
simultaneously,  increasing  throughput. 

In  the  SPMD  model,  the  same  program  runs  on  each  processor  (“single-program”)  but  each 
program  instance  processes  different  data  (“multiple-data”).  Different  instances  of  the  program 
may  follow  different  paths  through  the  program  (e.g.  if-else  statements)  due  to  differences  in 
input  data,  but  the  same  basic  program  executes  on  all  processors.  pMatlab  and  MatlabMPI 
follow  the  SPMD  model.  Note  that  SPMD  programs  can  be  written  in  a  MPMD  manner  by 
using  if-else  statements  to  execute  different  sections  of  code  on  different  processors.  In  fact,  the 
example  shown  in  Figure  5  is  actually  a  pMatlab  program  written  in  an  MPMD  manner.  The 
scalability  and  flexibility  of  the  SPMD  model  has  made  it  the  dominant  parallel  programming 
model.  SPMD  codes  range  from  tightly  coupled  {fine-grain  parallelism)  simulations  of 
evolution  equations  predicting  chemical  processes,  weather  and  combustion,  to  “embarrassingly 
parallel”  {coarse- grain)  programs  searching  for  genes,  data  strings,  and  patterns  in  data 
sequences  or  processing  medical  images  or  signals. 

Monte  Carlo  simulations  are  well  suited  to  SPMD  models.  Each  instance  of  a  Monte  Carlo 
simulation  has  a  unique  set  of  inputs.  Since  the  simulations  are  independent,  i.e.  the  input  of  one 
is  not  dependent  on  the  output  of  another,  the  set  of  simulations  can  be  distributed  among 
processors  and  execute  simultaneously,  as  depicted  in  Figure  6. 
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Figure  7  -  In  OpenMP,  threads  are  spawned  at  parallel  regions  of  code  marked  by  the  programmer.  In  MPI, 

every  process  runs  the  entire  application. 


The  two  most  common  implementations  of  the  SPMD  model  are  the  Message  Passing  Interface 
(MPI)  and  OpenMP.  In  MPI,  the  program  is  started  on  all  processors.  The  program  is 
responsible  for  distributing  data  among  processors,  then  each  processor  processes  its  section  of 
the  data,  communicating  among  themselves  as  needed.  MPI  is  an  industry  standard  and  is 
implemented  on  a  wide  range  of  parallel  computers,  from  multiprocessor  to  cluster  architectures 
[5].  Consequently,  MPI  is  the  most  popular  technology  for  writing  parallel  SPMD  programs. 
More  details  of  MPI  are  discussed  in  the  next  section. 


The  OpenMP  model  provides  a  more  dynamic  approach  to  parallelism  by  using  a  fork- join 
approach  [6].  In  OpenMP,  programs  start  as  a  single  process  known  as  a  master  thread  that 
executes  on  a  single  processor.  The  programmer  designates  parallel  regions  in  the  program. 
When  the  master  thread  reaches  a  parallel  region,  a  fork  operation  is  executed  that  creates  a  team 
of  threads,  which  execute  the  parallel  region  on  multiple  processors.  At  the  end  of  the  parallel 
region,  a  join  operation  terminates  the  team  of  threads,  leaving  only  the  master  thread  to  continue 
on  a  single  processor. 

pMatlab  and  MatlabMPI  use  the  MPI  model,  thus  the  code  runs  on  all  processors  for  the  entire 
runtime  of  the  program.  Figure  7  graphically  depicts  the  difference  between  pMatlab/MatlabMPI 
and  OpenMP  programs.  At  first,  this  may  seem  unnecessarily  redundant.  Why  bother  executing 
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code  in  regions  that  do  not  require  parallelization  on  multiple  processors  that  will  generate  the 
same  result?  Why  not  execute  only  the  parallelized  code  on  multiple  processors,  like  OpenMP? 

Consider  initialization  code  (which  is  typically  non-parallel)  that  initializes  variables  to  the  same 
value  on  each  processor.  Why  not  execute  the  code  on  a  single  processor,  then  send  the  values  to 
the  other  processors?  This  incurs  communication  which  can  significantly  reduce  performance. 
If  each  processor  executes  the  initialization  code,  the  need  for  communication  is  removed.  Using 
a  single  processor  for  non-parallel  code  doesn’t  improve  performance  or  reduce  runtime  of  the 
overall  parallel  program. 

The  fork-join  model  works  with  OpenMP  because  OpenMP  runs  on  shared-memory  systems,  in 
which  every  processor  uses  the  same  physical  memory.  Therefore,  communication  between 
threads  does  not  involve  high  latency  network  transactions  but  rather  low-latency  memory 
transactions. 

6.  Message  Passing 

Within  the  SPMD  model,  there  are  multiple  styles  of  programming.  One  of  the  most  common 
styles  is  message  passing.  In  message  passing,  all  processes  are  independent  entities,  i.e.  each 
process  has  its  own  private  memory  space.  Processes  communicate  and  share  data  by  passing 


send  (A) 


msg(A) 


recv(A) 


recv(F) 


msg(F)  * 


send(E) 


send(F) 


msg(E) 


recv(E) 


Figure  8  -  Example  of  message  passing.  Each  color  represents  a  different  processor.  The  upper  half 
represents  program  flow  on  each  processor.  The  lower  half  depicts  each  processor’s  memory  space. 
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messages.  Communication  is  performed  explicitly  by  directly  invoking  send  and  receive 
operations  in  the  code,  as  shown  in  Figure  8.  The  most  popular  message  passing  technology  is 
the  Message  Passing  Interface  (MPI),  a  message  passing  library  for  C  and  Fortran.  MPI  was 
developed  with  the  goal  of  creating  a  message  passing  standard  that  would  be  portable  across  a 
wide  range  of  systems,  from  multiprocessors  to  Beowulf  clusters. 

MPI  is  a  basic  networking  library  tailored  to  parallel  systems,  enabling  processes  in  a  parallel 
program  to  communicate  with  each  other.  Details  of  the  underlying  network  protocols  and 
infrastructure  are  hidden  from  the  programmer.  This  helps  achieve  MPI’s  portability  mandate 
while  enabling  programmers  to  focus  on  writing  parallel  code  rather  than  networking  code. 

To  distinguish  sub-processes  from  each  other,  each  process  is  given  a  unique  identifier  called  a 
rank.  The  rank  is  the  primary  mechanism  MPI  uses  to  distinguish  processors  from  each  other. 
Ranks  are  used  for  various  purposes.  MPI  programs  use  ranks  to  identify  which  data  should  be 
processed  by  each  process  or  what  operations  should  be  run  on  each  process.  The  following  are 
a  couple  examples: 

•  Executing  different  sections  of  code 

if  (rank  ==  0) 

%  Rank  0  executes  the  if  block 
else 

%  All  other  ranks  execute  the  else  block 
end 

•  Processing  different  data 

%  Use  rank  as  an  index  into  'input'  and  'output'  arrays 
output  (rank)  =  my_function (input  (rank) ) ; 

6.1.  MatlabMPI 

MatlabMPI  is  a  MATLAB  implementation  of  the  MPI  standard  developed  at  the  Lincoln 
Laboratory  to  emulate  the  “look  and  feel”  of  MPI  [7].  MatlabMPI  is  written  in  pure  MATLAB 
code  (i.e.  no  MEX  files)  and  uses  MATLAB ’s  file  I/O  functions  to  perform  communication, 
allowing  it  to  run  on  any  system  that  MATLAB  supports.  MatlabMPI  has  been  used 
successfully  both  inside  and  outside  Lincoln  to  obtain  speedups  in  MATLAB  programs 
previously  not  possible.  Ligure  9  depicts  a  message  being  sent  and  received  in  MatlabMPI. 

Despite  MPI’s  popularity,  writing  parallel  programs  with  MPI  and  MatlabMPI  can  still  be 
difficult.  Explicit  communication  requires  careful  coordination  of  sending  and  receiving 
messages  and  data  that  must  be  divided  among  processors  can  require  complex  index 
computations.  The  next  section  describes  how  message  passing  can  be  used  to  create  a  new 
parallel  programming  model,  called  global  array  semantics,  that  hides  the  complexity  of  message 
passing  from  the  programmer. 
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Figure  9  -  MatlabMPI  implements  the  fundamental  communication  operations  in  MPI  using  MATLAB’s  file 

I/O  functions. 


7.  Global  Array  Semantics 

Message  passing  provides  an  extremely  flexible  programming  paradigm  that  allows 
programmers  to  parallelize  a  wide  range  of  data  structures,  including  arrays,  graphs,  and  trees. 
However,  data  in  signal  processing  applications  are  generally  stored  in  regular  data  structures, 
e.g.  vectors,  matrices,  tensors,  etc. 

Global  array  semantics  is  a  parallel  programming  model  in  which  the  programmer  views  an 
array  as  a  single  global  array  rather  than  multiple,  independent  arrays  located  on  different 
processors,  e.g  MPI.  The  ability  to  view  related  data  distributed  across  processors  as  a  single 
array  more  closely  matches  the  serial  programming  model,  thus  making  the  transition  from  serial 
to  parallel  programming  much  smoother. 

A  global  array  library  can  be  implemented  using  message  passing  libraries  such  as  MPI  or 
MatlabMPI.  The  program  calls  functions  from  the  global  array  library.  The  global  array  library 
determines  if  and  how  data  must  be  redistributed  and  calls  functions  from  the  message  passing 
library  to  perform  the  communication.  Communication  is  hidden  from  the  programmer;  arrays 
are  automatically  redistributed  when  necessary,  without  the  knowledge  of  the  programmer.  Just 
as  message  passing  libraries  like  MPI  allow  programmers  to  focus  on  writing  parallel  code  rather 
than  networking  code,  global  array  libraries  like  pMatlab  allow  programmers  to  focus  on  writing 
scientific  and  engineering  code  rather  than  parallel  code. 

7.1.  pMatlab 

pMatlab  brings  global  array  semantics  to  MATLAB,  using  the  message  passing  capabilities  of 
MatlabMPI  [8].  The  ultimate  goal  of  pMatlab  is  to  move  beyond  basic  messaging  (and  its 
inherent  programming  complexity)  towards  higher  level  parallel  data  structures  and  functions, 
allowing  any  MATLAB  user  to  parallelize  their  existing  program  by  simply  changing  and  adding 
a  few  lines,  rather  than  rewriting  their  program,  as  would  likely  be  the  case  with  MatlabMPI. 
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See  Figure  10  for  a  comparison  of  implementing  an  FFT  and  cornerturn  in  MatlabMPI  and 
pMatlab. 

Clearly,  pMatlab  allows  the  scientist  or  engineer  to  access  the  power  of  parallel  programming 
while  focusing  on  the  science  or  engineering.  Figure  11  depicts  the  hierarchy  between  pMatlab 
applications,  the  pMatlab  and  MatlabMPI  libraries,  MATLAB  and  the  parallel  hardware. 


if  (my_rank==0 )  |  (my_rank==l )  |  (my_rank==2)  |  (my_rank==3) 

A_local=rand (M,  N/4 )  ; 
end 

if  (my_rank==4)  |  (my_rank==5)  |  (my_rank==6)  |  (my_rank==7) 

B_local=zeros (M/4 , N) ; 
end 

tag  =  0; 

if  (my_rank==0)  |  (my_rank==l)  |  (my_rank==2)  |  (my_rank==3) 

A_local=fft (A_local) ; 
for  ii  =  0:3 

MPI_Send (ii+4,  tag,  comm,  A_local (ii*M/4  +  1 : ( ii+1 ) *M/4 , : ) ) ; 

end 

end 

if  (my_rank==4)  |  (my_rank==5)  |  (my_rank==6)  |  (my_rank==7) 

for  ii  =  0:3 

B_local ( : ,  ii*N/4  +  1 : (ii+1 ) *N/4 )  =  MPI_Recv(ii,  tag,  comm) ; 

end 

end _ 

mapA  =  map ( [ 1  4],{},  [0:3]); 
mapB  =  map ([4  1],{},  [4:7]); 

A  =  rand (M,  N, mapA) ; 

B  =  zeros (M, N, mapB) ; 

B  (  :  ,  :  )  =  fft(A); _ 

Figure  10  -  MatlabMPI  (top)  and  pMatlab  (bottom)  implementations  of  an  FFT  and  cornerturn. 


Parallel 

Hardware 
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Figure  11  -  Parallel  MATLAB  consists  of  two  layers.  pMatlab  provides  parallel  data  structures  and 
library  functions.  MatlabMPI  provides  messaging  capability. 
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D  = 

ones (M,  P) ; 

o, 

o 

Create  a 

MxM  dmat  object  of  ones  with  map  P 

D  = 

zeros (M,  N,  R, 

P)  ; 

o, 

o 

Create  a 

MxNxR  dmat  object  of 

zeros  using 

o, 

o 

map  P 

D  = 

rand (M,  N,  R, 

S,  P); 

o, 

o 

Create  a 

MxNxRxS  dmat  object 

of  random 

o, 

o 

numbers  ' 

with  map  P 

D  = 

spalloc (M,  N, 

X,  P); 

o, 

o 

Create  a 

MxN  sparse  dmat  with 

room  to  hold 

o, 

o 

X  non-zero  values  with  map  P 

Figure  12  -  Examples  of  dmat  constructor  functions. 


8.  Introduction  to  pMatlab 

This  section  introduces  the  fundamentals  of  pMatlab:  creating  data  structures;  writing  and 
launching  applications;  and  testing,  debugging  and  scaling  your  application.  A  tutorial  for 
pMatlab  can  be  found  in  [9].  For  details  on  functions  presented  here  and  examples,  refer  to  [10]. 

8.1.  Distributed  Matrices 

pMatlab  introduces  a  new  datatype:  the  distributed  matrix,  or  dmat.  The  dmat  is  the 
fundamental  data  storage  datatype  in  pMatlab,  equivalent  to  double  in  MATLAB.  pMatlab  can 
support  dmat  objects  with  two,  three,  or  four  dimensions,  dmat  objects  must  be  explicitly 
constructed  in  pMatlab  via  a  constructor  function. 

Currently,  pMatlab  overloads  four  MATLAB  constructor  functions:  zeros,  ones,  rand,  and 
spalloc.  Each  of  these  functions  accepts  the  same  set  of  parameters  as  their  corresponding 
MATLAB  functions,  with  the  addition  of  a  map  parameter.  The  map  parameter  accepts  a  map 
object  which  describes  how  to  distribute  the  dmat  object  across  multiple  processors. 

Figure  12  contains  examples  of  how  to  create  dmats  of  varying  dimensions.  Section  8.2 
describes  map  objects  and  how  to  create  them  in  greater  detail.  Section  8.3  discusses  how  to 
program  with  dmats.  Refer  to  [10]  for  details  on  each  constructor  function. 

8.2.  Maps 

For  a  dmat  object  to  be  useful,  the  programmer  must  tell  pMatlab  how  and  where  the  dmat  must 
be  distributed.  This  is  accomplished  with  a  map  object.  To  create  a  dmat,  the  programmer  first 
creates  a  map  object,  then  calls  one  of  several  constructor  functions,  passing  in  the  dmat’s 
dimensions  and  the  map  object. 

Each  map  object  has  four  components: 

•  The  grid  is  a  vector  of  integers  that  specifies  how  each  dimension  of  a  dmat  is  broken  up. 
For  example,  if  the  grid  is  [  2  3  ] ,  the  first  dimension  is  broken  up  between  2  processors 
and  the  second  dimension  is  broken  up  between  3  processors,  as  shown  in  Figure  13. 

•  The  distribution  specifies  how  to  distribute  each  dimension  of  the  dmat  among 
processors.  There  are  three  types  of  distributions: 

o  Block  -  Each  processor  contains  a  single  contiguous  block  of  data 
o  Cyclic  -  Data  are  interleaved  among  processors 
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o  Block-cyclic  -  Contiguous  blocks  of  data  are  interleaved  among  processors.  Note 
that  block  and  cyclic  are  special  cases  of  block-cyclic.  Block-cyclic  distributions 
require  the  user  to  also  specify  the  block  size. 

Figure  14  shows  an  example  of  the  same  dmat  distributed  over  four  processors  using 
each  of  the  three  types  of  data  distributions. 

•  The  processor  list  specifies  on  which  ranks  the  object  should  be  distributed.  Ranks  are 
assigned  column-wise  (top-down,  then  left-right)  to  grid  locations  in  sequential  order 
(Figure  15). 

•  Overlap  is  a  vector  of  integers  that  specify  the  amount  of  data  overlap  between 
processors  for  each  dimension.  Only  block  distributions  can  have  overlap.  Overlap  is 
useful  in  situations  when  data  on  the  boundary  between  two  processors  are  required  by 
both,  e.g.  convolution.  In  these  cases,  using  overlap  copies  boundary  data  onto  both 
processors. 

Figure  16  shows  an  example  of  a  dmat  distributed  column- wise  across  four  processors 
with  1  column  of  overlap  between  adjacent  processors. 

The  following  are  examples  of  how  to  create  various  types  of  maps: 

•  2D  map,  1x2  grid,  block  along  rows,  cyclic  along  columns: 


gridl  =  [1  2  ]  ; 
distl(l) .dist  =  ' b ' ; 
distl(2) .dist  =  'c'; 
procl  =  [0:1] ; 
mapl  =  map (gridl,  distl. 


%  1x2  grid 

%  block  along  dim  1  (rows) 

%  cyclic  along  dim  2  (columns) 
%  list  of  ranks  0  and  1 
procl )  ; 


•  3D  map,  2x3x2  grid,  block-cyclic  along  rows  and  columns  with  block  size  2,  cyclic  along 
third  dimension: 


grid2  =  [2  3  2 ] ; 
dist2(l) .dist  =  'be'; 
dist2(l) ,b_size  =  2; 
dist2(2) .dist  =  'be'; 
dist2(2) ,b_size  =  2; 
dist2  (3)  .dist  =  'c'; 
proc2  =  [0:11] ; 
map2  =  map(grid2,  dist2. 


%  2x3x2  grid 

%  block-cyclic  along  dim  1  (rows) 

%  block  size  2  along  dim  1  (rows) 

%  block-cyclic  along  dim  2  (columns) 
%  block  size  2  along  dim  2  (columns) 
%  cyclic  along  dim  3 
%  list  of  ranks  0  through  11 
proc2 )  ; 


•  2D  map,  2x2  grid,  block  along  both  dimensions,  overlap  in  the  column  dimension  of  size 
1(1  column  overlap).  This  example  shows  how  to  specify  a  map  in  a  single  line  of  code: 

%  param  1 :  2x2  grid 

%  param  2:  block  in  all  dimensions 
%  param  3:  list  of  ranks  0  and  1 

%  param  4:  overlap  of  0  along  dim  1,  overlap  of  1  along  dim  2 
map 3  =  map  ([2  2],  {},  [0  1],  [0  1]); 
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Figure  13  -  Example  of  a  map  grid. 


Figure  14  -  Examples  of  different  data  distributions. 


Figure  15  -  Example  of  how  processors  are  assigned  to  grid  locations. 
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Figure  16  -  Example  of  data  overlap  between  processors. 


8.3.  Working  with  Distributed  Matrices 

Using  MATLAB’s  function  overloading  feature,  pMatlab  is  able  to  provide  new 
implementations  of  existing  MATLAB  functions.  This  capability  was  demonstrated  in  Section 

8.1.  However,  pMatlab  does  not  limit  itself  to  overloading  just  constructors,  but  other  functions 
as  well.  These  overloaded  functions  provide  the  same  functionality  as  their  MATLAB 
equivalents  but  operate  on  the  dmats  rather  than  doubles.  In  this  way,  pMatlab  provides  an 
interface  that  is  nearly  identical  to  MATLAB  while  abstracting  the  parallel  computation  and 
communication  from  the  user.  While  some  functions  require  additional  input,  such  as  the  map 
parameter  for  the  dmat  constructor  functions  discussed  in  Section  8.1,  most  functions  look 
identical  to  their  MATLAB  counterparts. 

pMatlab  overloads  a  number  of  basic  MATLAB  functions.  Most  of  these  overloaded  functions 
implement  only  a  subset  of  the  available  functionality  of  each  function.  For  example, 
MATLAB  ’s  zeros  function  allow  matrix  dimensions  to  be  specified  by  arbitrarily  long  vectors 
while  the  pMatlab  zeros  function  does  not. 

Occasionally,  pMatlab  users  may  wish  to  directly  manipulate  a  dmat  object.  Thus,  pMatlab 
provides  additional  functions  that  allow  the  user  to  access  the  contents  of  dmat  object  without 
using  overloaded  pMatlab  functions. 

This  section  will  introduce  various  classes  of  functions  available  in  pMatlab  and  touch  on  the 
major  differences  between  MATLAB  and  pMatlab  functions. 

8.3.1.  Subscripted  Assignment  and  Reference 

Two  of  the  most  fundamental  MATLAB  operations  are  subscripted  assignment  (subs as gn)  and 
reference  (subsref).  Initially,  both  operators  appear  similar  in  that  they  are  both  invoked  by 
the  parentheses  operator,  ( ) ,  and  access  subsections  of  a  matrix.  However,  the  degrees  to  which 
they  duplicate  MATLAB’s  built-in  operations  are  substantially  different. 


N  =  1000000; 

%  A  and  B  have  different  maps 
mapA  =  map ( [ 1  Np] ,  {},  0:Np-l); 

mapB  =  map ( [Np  1],  {},  0:Np-l); 

A  =  zeros  (N,  mapA) ; 
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B  =  rand(N,  mapB); 

%  Invokes  pMatlab' s  subsasgn. 

%  Data  in  B  is  remapped  to  mapA  and  assigned  to  A. 

A (  :  ,  : )  =  B; 

%  Performs  a  memory  copy. 

%  A  is  replaced  with  a  copy  of  B,  i.e.  A  has  the  same  data  AND  mapping 
%  as  B. 

A  =  B 

Figure  17  -  Examples  of  when  pMatlab’s  subsasgn  is  and  is  not  invoked. 


%  Fully  supported 

A ( : , : )  %  Entire  matrix  reference 

A  ( i , j )  %  Single  element  reference 

%  Partially  supported 

A(i:k,  j : 1 )  %  Arbitrary  submatrix  reference 

Figure  18  -  Examples  of  fully  supported  and  partially  supported  cases  subsref  in  pMatlab. 


Subscripted  assignment 

pMatlab’s  subsasgn  operator  supports  nearly  all  capabilities  of  MATLAB’s  built-in  operator. 
pMatlab’s  subsasgn  can  assign  any  arbitrary  subsection  of  a  dmat.  If  a  subsection  spans 
multiple  processors,  subsasgn  will  automatically  write  to  the  indices  on  the  appropriate 
processors. 

subsasgn  in  pMatlab  has  one  restriction.  In  MATLAB,  the  parentheses  operator  can  be 
omitted  when  assigning  to  the  entire  matrix.  In  pMatlab,  the  parentheses  operator  must  be  used 
in  this  case.  Performing  an  assignment  without  ( )  simply  performs  a  memory  copy  and  the 
destination  dmat  will  be  identical  -  including  the  mapping  -  to  the  source;  data  will  not  be 
redistributed.  See  Figure  17  for  an  example. 

Subscripted  reference 

The  capabilities  of  pMatlab’s  subsref  operator  are  much  more  limited  than  subsasgn. 
subsref  returns  a  “standalone”  dmat  only  in  the  following  cases: 

•  A  ( : ,  : )  —  Refers  to  the  entire  contents  of  the  dmat  A 

•  A (i,  j )  -  Refers  to  a  single  element  in  the  dmat  A. 

Referring  to  an  arbitrary  subsection  will  not  produce  a  “standalone”  dmat,  i.e.  the  resulting  dmat 
can  not  be  directly  used  as  an  input  to  any  pMatlab  function,  with  the  exception  of  local.  The 
details  are  outside  the  scope  of  this  document;  this  is  a  limitation  of  pMatlab  and  addressing 
these  issues  requires  further  research.  For  more  information  about  the  local  function,  see 
Section  8.3.4.  Figure  18  contains  examples  of  fully  and  partially  supported  cases  of  subsref. 
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8.3.2.  Arithmetic  Operators 

A  number  of  arithmetic  operators,  such  as  addition  (+),  matrix  multiplication  (*),  element-wise 
multiplication  ( .  *),  and  equality  (==)  have  been  overloaded.  These  operators  are  used  the  same 
way  as  MATLAB.  Figure  19  show  several  examples  of  various  arithmetic  operators  in  pMatlab. 

8.3.3.  Mathematical  Functions 

Several  mathematical  functions,  including  simple  functions  such  as  absolute  value  (abs)  and 
advanced  functions  such  as  fft  have  been  overloaded  in  pMatlab.  Most  of  these  functions  are 
called  the  same  way  as  in  MATLAB.  Some  pMatlab  functions  may  not  support  the  entire  syntax 
supported  by  the  equivalent  MATLAB  function,  but  all  will  support  at  least  the  basic  syntax. 
Figure  20  shows  several  examples  of  how  to  use  various  mathematical  functions  in  pMatlab. 


N  =  1000000; 

Ml  =  map ( [1  Np] ,  { 

}, 

0  :  Np-1 ) ; 

M2  =  map ( [Np  1] ,  { 

1, 

0  :  Np-1 ) ; 

%  Add  two  dmats 

A  =  rand (N,  Ml ) ; 

o, 

o 

NxN  dmat  mapped  to  Ml 

B  =  rand  (N,  Ml ) ; 

o, 

o 

NxN  dmat  mapped  to  Ml 

Cl  =  A  +  B; 

g, 

o 

Result  is  mapped  to  Ml 

C2  =  zeros (N,  M2); 

o, 

o 

NxN  dmat  mapped  to  M2 

C2 ( : ,  :  )  =  A  +  B; 

o, 

o 

Result  is  remapped  to  M2 

%  Matrix  multiply 

dmat  and  a  double 

D  =  rand  (N,  Ml ) ; 

o, 

o 

NxN  dmat  mapped  to  Ml 

E  =  rand (N ) ; 

g, 

o 

NxN  double 

FI  =  D  *  E; 

g, 

o 

Result  is  mapped  to  Ml 

F2  =  zeros (N,  M2); 

g, 

o 

NxN  dmat  mapped  to  M2 

F2 ( :  ,  : )  =  D  *  E; 

g, 

o 

Result  is  remapped  to  M2 

%  Check  for  equality  ’ 

between  two  dmats 

G  =  rand (N,  Ml ) ; 

o, 

o 

NxN  dmat  mapped  to  Ml 

H  =  rand (N,  Ml ) ; 

o, 

o 

NxN  dmat 

at 

ll 

ll 

o 

ll 

\ — 1 

H 

g, 

o 

Result  is  mapped  to  Ml 

12  =  zeros (N,  M2); 

g, 

o 

NxN  dmat  mapped  to  M2 

12 ( :  ,  : )  =  (G  ==  H) 

•  2- 

f  ° 

Result  is  remapped  to  M2 

Figure  19  -  Examples  of  how  to  use  arithmetic  operators  with  dmats 


N  =  1000000; 

Ml  =  map ( [ 1  Np]  ,  {},  0:Np-l); 

M2  =  map ( [Np  1],  {},  0:Np-l); 

%  Absolute  value 

A  =  rand (N,  Ml);  %  NxN  dmat  mapped  to  Ml 
B1  =  abs (A) ;  %  Result  is  mapped  to  Ml 
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B2  =  zeros (N,  M2); 

o, 

o 

NxN  dmat  mapped  to  M2 

B2 ( : , : )  =  abs (A) ; 

o. 

o 

Result  is  remapped  to  M2 

%  FFT 

D  =  rand  (N,  Ml ) ; 

o, 

o 

NxN  dmat  mapped  to  Ml 

El  =  fft  (D)  ; 

o. 

o 

Result  is  mapped  to  Ml 

E2  =  zeros  (N,  M2); 

o, 

o 

NxN  dmat  mapped  to  M2 

E2  (  :  ,  :  )  =  fft (D)  ; 

o, 

o 

Result  is  remapped  to  M2 

Figure  20  -  Examples  of  how  to  use  mathematical  functions  with  dmats 


8.3.4.  Global  vs.  Local  Scope 

The  notion  of  matrices  distributed  across  multiple  processors  raises  the  issue  of  global  and  local 
scopes  of  a  dmat.  The  global  scope  of  a  dmat  refers  to  the  matrix  as  a  single  entity,  while  local 
scope  refers  to  only  the  section  of  the  dmat  residing  on  the  local  processor.  pMatlab  includes 
various  functions  that  specifically  address  the  global  and  local  scopes  of  dmats.  These  functions 
allow  the  user  a  greater  degree  of  control  over  dmat  objects,  but  must  be  used  with  caution. 

The  local  and  put_local  functions  give  the  user  direct  access  to  the  local  contents  of  dmat 
objects.  Each  processor  in  a  dmat’s  map  contains  a  section  of  the  dmat.  local  returns  a  matrix 
that  contains  the  section  of  the  dmat  that  resides  on  the  local  processor.  Conversely, 
put_local  writes  the  contents  of  a  matrix  into  that  processor’s  section  of  the  dmat.  The 
local  and  put_local  functions  allow  the  user  to  perform  operations  not  implemented  in 
pMatlab  while  still  taking  advantage  of  pMatlab’ s  ability  to  distribute  data  and  computation. 

When  local  returns  the  local  portion  of  a  dmat,  the  resulting  matrix  loses  global  index 
information,  i.e.  which  indices  in  the  global  map  it  maps  to.  See  Figure  21  for  an  example. 
Sometimes  it  is  necessary  for  each  processor  to  know  which  global  indices  it  owns.  pMatlab 
contains  functions  that  compute  what  global  indices  each  processor  owns  for  a  given  dmat: 

•  global_block_range/global_block_ranges  -  Returns  the  first  and  last  indices  of 
a  continuous  block  of  indices  owned  by  local  processor  and  all  processors,  respectively. 
Works  for  block  distributed  dmats  only. 

•  global_ind/global_inds  -  Returns  the  indices  owned  by  the  local  processor  and  all 
processors,  respectively. 

•  global_range/global_ranges  -  Returns  the  first  and  last  indices  for  multiple 
blocks  of  indices  owned  by  the  local  processor  and  all  processors,  respectively.  Works 
for  all  distributions. 
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Figure  21  -  Comparison  of  global  and  local  indices  in  a  dmat. 


1:  %  RUN . m  is  a  generic  script  for  running  pMatlab  scripts. 

2  : 

3:  %  Define  number  of  processors  to  use 

4 :  Ncpus  =  4 ; 

5: 

6:  %  Name  of  the  script  you  want  to  run 

7:  mFile  =  ' sample_application ' ; 

8  : 

9:  %  Define  cpus 

10  : 

11:  %  Run  on  user's  local  machine 

12 :  %  cpus  =  {  }  ; 

13: 

14:  %  Specify  which  machines  to  run  on 

15:  %  cpus  =  {'nodel.ll.mit.edu',  ... 

16:  'node2.ll.mit.edu',  ... 

17:  'node3.ll.mit.edu',  ... 

18:  'node4.ll.mit.edu'}; 

19: 

10  : 

21:  %  Run  the  script. 

22:  ['Running:  '  mFile  '  on  '  num2str (Ncpus )  '  cpus'] 

23:  eval (pRUN (mFile,  Ncpus,  cpus)); 


Figure  22  -  Example  script  for  launching  pMatlab  applications. 
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8.4.  Launching  pMatlab 

The  recommended  method  of  launching  pMatlab  programs  is  to  use  a  launch  script  similar  to  the 
one  in  Figure  22.  The  launch  script  must  be  located  and  run  in  the  same  directory  as  the 
pMatlab  application. 

Line  4  defines  the  Ncpus  variable,  which  specifies  the  how  many  processors  to  run  on. 

Lines  7  defines  the  mfile  variable,  which  specifies  which  pMatlab  program  to  run.  mfile 
should  be  set  to  the  filename  of  the  pMatlab  program  without  the  .  m  extension. 

Lines  12  and  15-18  are  examples  of  different  ways  to  define  the  cpus  variable,  which  specifies 
where  to  launch  the  pMatlab  program.  The  user  should  uncomment  the  line(s)  he  wishes  to  use. 

•  Line  12  directs  pMatlab  to  launch  all  MATLAB  processes  on  the  user’s  local  machine; 
this  is  useful  for  debugging  purposes,  but  Ncpus  should  be  set  to  a  small  number  (i.e.  less 
than  or  equal  to  4)  to  prevent  overloading  the  local  machine’s  processor. 

•  Lines  15-18  directs  pMatlab  to  launch  the  MATLAB  processes  on  the  machines  in  the 
list.  If  Ncpus  is  smaller  than  the  list,  pMatlab  will  use  the  first  Ncpus  machines  in  the 
list.  If  Ncpus  is  larger  than  the  list,  pMatlab  will  wrap  around  to  the  beginning  of  the  list. 

These  are  the  basic  ways  to  launch  pMatlab;  for  a  more  details,  please  refer  to  pRUN  and 
MPI_Run  in  [10]. 

Line  22  displays  a  statement  indicating  the  pMatlab  program  is  about  to  be  launched. 

Line  23  launches  the  pMatlab  program  specified  by  mfile  on  Ncpus  processors,  using  the 
machines  specified  in  cpus. 

Note  that  program  specific  parameters  can  not  be  passed  into  pMatlab  applications,  via 
pRUN.  Program  parameters  should  be  implemented  as  variables  initialized  at  the  beginning  of 
the  pMatlab  application  specified  in  mfile,  such  as  line  1  in  the  pMatlab  code  shown  in  Figure 
23,  which  will  be  explained  in  the  next  section. 

8.5.  A  Simple  pMatlab  Application 

This  section  will  describe  a  very  simple  pMatlab  application  that  performs  a  parallel  FFT  shown 
in  Figure  23.  This  example  can  be  found  in  the  examples  directory  in  the  pMatlab  library. 

Line  1  sets  the  size  of  the  input  matrix  to  the  FFT. 

Line  4  enables  or  disables  the  pMatlab  library.  If  parallel  is  set  to  0,  then  the  script  will  not 
call  any  pMatlab  functions  or  create  any  pMatlab  data  structures  and  will  run  serially  on  the  local 
processor.  If  parallel  is  set  to  1,  then  the  pMatlab  library  will  be  initialized,  dmats  will  be 
created  instead  of  regular  MATLAB  matrices,  and  any  functions  that  accept  dmat  inputs  will 
call  the  overloaded  pMatlab  functions. 
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Line  7  initializes  the  maps,  mapX  and  mapY,  for  the  input  and  output  matrices,  respectively.  If 
parallel  is  set  to  0,  then  mapX  and  mapY  are  set  to  1. 


1:  N  =  2^10;  %  NxN  Matrix  size. 

2  : 

3:  %  Turn  parallelism  on  or  off. 

4:  PARALLEL  =1;  %  Can  be  1  or  0.  OK  to  change. 

5: 

6:  %  Create  Maps. 

7:  mapX  =  1;  mapY  =  1; 

8:  if  (PARALLEL) 

9:  %  Break  up  channels. 

10:  mapX  =  map([l  Ncpus],  {},  0:Ncpus-l); 

11:  mapY  =  map([l  Ncpus],  {},  0:Ncpus-l); 

12 :  end 

13: 

14:  %  Allocate  data  structures. 

15:  X  =  rand (N, N, mapX) ; 

16:  Y  =  zeros (N,  N,  mapY )  ; 

17  : 

18:  %  Do  fft.  Changes  Y  from  real  to  complex. 

19:  Y (:,:)=  fft  (X)  ; 

20  : 

21:  %  Finalize  the  pMATLAB  program 

22:  disp (' SUCCESS ') ; 

Figure  23  -  A  parallel  FFT  implemented  in  pMatlab. 

If  parallel  is  set  1,  then  lines  8  through  12  are  executed.  Lines  10  and  11  create  map  objects 
for  the  dmats  X  and  Y.  The  maps  distribute  the  dmats  column- wise. 

Lines  15  and  16  construct  the  dmats  x  and  Y.  x  is  an  nxn  dmat  of  random  values  and  Y  is  an 
nxn  dmat  of  zeros. 

Line  19  calls  fft  on  x  and  assigns  the  output  to  Y.  Since  x  is  a  dmat,  rather  than  call  the  built- 
in  fft,  MATLAB  calls  pMatlab’s  fft  that  is  overloaded  for  dmat. 

Lines  28  through  30  finalizes  the  execution  of  the  pMatlab  library  if  parallel  is  set  to  1. 

8.6.  Interactive  Use  of  pMatlab 

Before  discussing  pMatlab’s  interactive  capabilities,  it  is  important  that  we  define  “interactive” 
with  respect  to  MATLAB  and  pMatlab  and  how  these  definitions  differ.  In  MATLAB,  the  user 
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Cluster  machines 


Figure  24  -  Interactively  running  pMatlab  applications  on  both  the  user’s  computer  and  a  cluster  of 
machines.  I’id=0  runs  on  the  user’s  machine  and  the  remaining  ranks  run  on  the  cluster. 

command  prompt  or  by  invoking  scripts  and  functions  that  contain  MATLAB  code.  The  former 
is  MATLAB’ s  definition  of  “interactive.” 

In  pMatlab,  the  user  can  issue  pMatlab  commands  only  by  invoking  scripts  or  functions  that 
contain  pMatlab  code.  pMatlab  commands  can  not  be  run  directly  from  the  command 
prompt.  Thus,  pMatlab  does  not  support  MATLAB’ s  notion  of  “interactive”  use. 

However,  pMatlab  does  support  a  different  type  of  “interactivity.”  Traditionally,  running 
parallel  programs  requires  logging  into  a  “submit”  node  of  a  parallel  computer,  writing  a  submit 
script  (containing  the  program  name,  parameters,  and  number  of  processors),  and  submitting  the 
script  to  a  job  scheduler.  Once  submitted,  jobs  often  must  wait  in  a  job  queue  before  they  are 
executed.  The  job  scheduler  e-mails  the  user  when  the  program  has  finished.  This  process  is 
known  as  batch  processing  and  is,  by  nature,  highly  non-interactive. 

While  pMatlab  still  supports  limited  batch  processing,  it  introduces  a  level  of  interactivity  to 
parallel  programming  previously  unseen.  To  launch  pMatlab  jobs,  a  user  first  starts  MATLAB 
on  his  personal  machine,  then  runs  the  run  .  m  script,  such  as  the  one  presented  in  Figure  22. 
When  the  pMatlab  job  starts  it  not  only  launches  MATLAB  processes  on  remote  machines  but  it 
also  uses  the  existing  MATLAB  process,  known  as  the  leader  process,  on  user’s  personal 
machine  as  well,  as  shown  in  Figure  24.  Consequently,  unlike  the  traditional  batch  process 
model  of  submitting  a  parallel  job,  waiting  for  a  notification  of  completion,  transferring  files  to  a 
desktop  and  then  post-processing  the  results,  pMatlab  users  are  able  to  use  their  personal  display 
and  utilize  MATLAB’s  graphics  capabilities  as  the  job  processes  or  completes. 

8.7.  Developing,  Testing,  Debugging,  and  Scaling 

Though  pMatlab  removes  a  great  deal  of  the  complexity  associated  with  parallel  code 
development,  the  process  is  still  inherently  more  difficult  than  serial  coding.  As  with  serial 
coding  there  are  software  development  strategies  which  lead  to  simplified  code  development.  In 
addition,  parallel  coding  raises  the  issues  of  scaling  to  more  processors  and  receiving  debugging 
information  from  multiple  nodes,  which  do  not  arise  in  serial  implementations. 
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8.7.1.  Developing 

There  are  several  guidelines  that  pMatlab  users  should  follow  when  developing  pMatlab 
applications.  We  list  the  most  important  below. 

Write  code  that  can  be  run  independent  of  pMatlab 

A  general  tenet  of  good  software  engineering  is  that  computer  code  should  be  as  modular  as 
possible.  The  easier  it  is  to  enable  and  disable  portions  of  the  simulation,  the  easier  it  is  to 
pinpoint  the  section  which  holds  an  error.  In  addition,  the  ability  to  quickly  remove  and  replace 
a  specific  module  allows  for  easy  updates  of  existing  code. 

pMatlab  code  is  no  exception  and  should  be  written  in  such  a  manner  that  the  pMatlab  library 
can  be  easily  disabled  to  run  the  application  serially,  i.e.  on  a  single  processor.  Developing 
applications  in  this  manner  assists  the  user  in  locating  bugs.  In  parallel  computing  there  are 
generally  two  types  of  errors,  those  resulting  from  algorithmic  choice  and  those  resulting  from 
parallel  implementation.  In  all  cases  it  is  recommended  that  the  code  be  run  in  serial  first  to 
resolve  any  algorithmic  issues.  After  the  code  has  been  debugged  in  serial,  any  new  bugs  in  the 
parallel  code  can  be  assumed  to  arise  from  the  parallel  implementation.  Being  able  to  easily 
enable  and  disable  the  pMatlab  library  can  help  the  user  determine  if  a  bug  is  caused  by  the 
algorithm  or  by  parallelization.  Figure  22  provides  an  example  of  this  development  approach, 
using  the  PARALLEL  variable. 

Write  scalable  code 

Avoid  writing  code  that  depends  on  a  specific  number  of  processors.  When  writing  parallel 
code,  remember  that  along  with  speedup  you  gain  the  ability  to  work  on  larger  problems.  The 
problem  that  fits  on  four  processors  today  may  grow  to  fit  on  eight  next  year  and  so  on.  If  your 
pMatlab  application  requires  a  fixed  number  of  processors,  it  is  impossible  to  scale  the 
simulation  up  (or  down)  without  editing,  rebuilding  and  retesting  the  code.  In  most  cases  writing 
scalable  pMatlab  code  is  relatively  straightforward  as  the  parallelism  is  implicitly  controlled  by 
the  library  rather  than  explicitly  controlled  by  the  user,  unlike  libraries  such  as  MPI  and 
MatlabMPI. 

The  above  practice  of  avoiding  the  use  of  hard  coded  numbers  of  processors  applies  to  all  aspects 
of  the  code.  The  more  the  code  is  parameterized  the  less  rewriting,  rebuilding,  retesting  is 
required.  Thus,  when  initializing  pMatlab,  make  sure  to  obtain  the  number  of  processors  from 
the  pMATLAB  variable  and  store  it  to  a  variable  like  Ncpus.  Use  this  variable  when  creating 
maps  in  order  to  create  scalable,  flexible  code.  See  Figure  22  for  an  example. 

Avoid  Pid-dependent  code 

Traditionally  parallel  programmers  had  to  keep  track  of  the  data  distribution,  and  occasionally 
needed  to  create  Pid  dependent  code,  e.g.  writing  if-else  statements  that  execute  different 
sections  of  code  on  different  processors.  However,  pMatlab  has  been  designed  to  alleviate  all  of 
that  for  the  programmer.  The  main  philosophy  behind  pMatlab  is  to  hide  the  parallelism  from 
the  user  as  much  as  possible.  The  motivation  for  pMatlab  was  to  relieve  programmers  of  the 
burden  of  explicitly  parallelizing  their  code  as  required  in  MPI  and  MatlabMPI. 
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%  Let  D  be  of  type  dmat . 

%  D_agg  contains  entire  contents  of  D  on  rank  0. 
%  D_agg  is  empty  on  all  other  ranks. 

D_agg  =  agg(D); 

if  (Pid  ==  0) 

%  Process  D_agg  on  Pid  0 

end 


%  Save  "results"  variable  to  " result s . <Pid> . mat " 
filename  =  ['results.'  num2str (Pid) ] ; 
save (filename,  'results'); 

Figure  25  -  Examples  of  “safe”  rank-dependent  pMatlab  code. 

pMatlab  users  should  not  need  to  write  code  that  depends  on  the  Pid  and  should  indeed  avoid  it 
as  rPid-dependent  code  can  potentially  break  or  cause  unexpected  behavior  in  pMatlab  programs. 

There  are  some  exceptions  to  this  rule.  One  is  related  to  the  agg  function  which  returns  the 
entire  contents  of  a  dmat  to  the  leader  processor  but  returns  only  the  local  portion  of  matrix  to 
the  remaining  processors.  There  may  be  cases  where  the  user  wishes  to  operate  on  the 
aggregated  matrix  without  redistributing  it.  In  this  case,  the  code  that  operates  on  the  aggregated 
matrix  might  be  within  an  if  block  than  runs  on  only  the  leader  processor.  Note  that  Pid  0  is 
chosen  as  the  leader  processor  because  there  will  always  be  a  Pid  0,  even  when  running  on  a 
single  processor. 

The  other  main  exception  to  writing  Pid  dependent  code  is  associated  with  I/O.  If  the  user 
wishes  each  process  in  the  pMatlab  program  to  save  its  results  to  a  .mat  file,  then  each 
processor  must  save  its  results  to  a  unique  filename.  The  simplest  way  to  accomplish  this  is  to 
include  the  rank  in  the  filename  as  each  rank  is  unique.  Figure  25  provides  examples  of  “safe” 
Pid-dependent  pMatlab  code. 

8.7.2.  Testing 

When  a  serial  MATLAB  program  fails,  the  user  is  notified  directly  via  an  error  message  (and 
usually  a  beep)  at  the  MATLAB  command  prompt.  Failures  in  pMatlab  programs  can  be  much 
more  devious  and  subtle.  A  failure  in  just  one  MATLAB  process  can  cause  the  entire  program  to 
hang.  For  instance,  suppose  a  pMatlab  program  contains  a  cornertum,  which  requires  all 
processes  to  communicate  with  all  other  processors.  If  one  MATLAB  process  fails  prior  to  the 
cornerturn,  the  remaining  processes  will  hang  while  waiting  for  a  message  that  will  never  arrive 
from  the  failed  process.  All  non-leader  MATLAB  processes  redirect  their  output  -  and  error 
messages  -  to  .out  files  in  the  MatMPi  directory.  If  the  error  occurs  on  a  non-leader  process, 
the  user  will  receive  no  notification  of  the  error.  This  is  a  common  occurrence  in  many  pMatlab 
programs.  Users  should  make  a  habit  of  routinely  checking  the  contents  of  the 
MatMPi/*  .  out  files  when  they  suspect  their  programs  have  hung. 

Improving  performance  via  parallelism  is  no  excuse  for  inefficient  code.  Parallel  programs  can 
also  benefit  from  profiling  to  identify  bottlenecks  in  both  serial  and  parallel  sections. 
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8.7.3.  Debugging 

The  MATLAB  debugger  can  be  used  to  assist  in  debugging  pMatlab  programs  in  a  limited 
fashion.  The  debugger  runs  on  only  the  leader  process  on  the  user’s  machine.  Nevertheless, 
even  this  limited  debugging  capability  can  provide  valuable  information  to  the  user. 

Since  pMatlab  is  SPMD,  every  processor  should  execute  every  line  of  code.  However,  since  the 
debugger  only  runs  on  the  leader  process,  breakpoints  do  not  affect  non-leader  processes.  Non¬ 
leader  processes  will  simply  continue  past  breakpoints  set  by  the  user  until  they  reach  a 
parallelized  operation  which  requires  communication  with  the  leader.  However,  this 
asynchronous  behavior  will  not  affect  the  behavior  of  the  program  or  the  debugger. 

8.7.4.  Scaling 

Another  tenet  of  good  software  engineering  is  that  programs  should  not  be  run  on  full  scale 
inputs  immediately.  Rather,  programs  should  initially  be  run  on  a  small  test  problem  to  verify 
functionality  and  to  validate  against  known  results.  The  programs  should  be  scaled  to  larger  and 
more  complicated  problems  until  the  program  is  fully  validated  and  ready  to  be  run  at  full  scale. 
The  same  is  true  for  parallel  programming.  Parallel  programs  should  not  run  at  full  scale  on  32 
processors  as  soon  as  the  programmer  has  finished  taking  a  first  stab  at  writing  the  application. 
Both  the  test  input  and  number  of  processors  should  be  gradually  scaled  up. 

The  following  is  the  recommended  procedure  for  scaling  up  a  pMatlab  application.  This 
procedure  gradually  adds  complexity  to  running  the  application. 

1.  Run  with  1  processor  on  the  user’s  local  machine  with  the  pMatlab  library  disabled. 

This  tests  the  basic  serial  functionality  of  the  code. 

2.  Run  with  1  processor  on  the  local  machine  with  pMatlab  enabled.  Tests  that  the 
pMatlab  library  has  not  broken  the  basic  functionality  of  the  code. 

3.  Run  with  2  processors  on  the  local  machine.  Tests  the  program’s  functionality  works 
with  more  than  one  processor  without  network  communication. 

4.  Run  with  2  processors  on  multiple  machines.  Test  that  the  program  works  with 
network  communication. 

5.  Run  with  4  processors  on  multiple  machines. 

6.  Increase  the  number  of  processors,  as  desired. 

Figure  26  shows  the  sequence  of  parameters  that  should  be  used  to  scaling  pMatlab  applications. 
See  Figure  22  and  Figure  23  for  examples  on  how  to  set  these  parameters. 


In  pMatlab  code 

In  RUN  .  m 

1. 

PARALLEL  =  0; 

Ncpus  =  1; 

cpus  =  { } ; 

2. 

PARALLEL  =  1; 

Ncpus  =  1; 

cpus  =  { } ; 

3. 

PARALLEL  =  1; 

Ncpus  =  2; 

cpus  =  { } ; 

4. 

PARALLEL  =  1; 

Ncpus  =  2; 

cpus  =  { ' nodel ' , ' node2 ' } ; 

5. 

PARALLEL  =  1; 

Ncpus  =  4; 

cpus  =  { ' nodel ' , ' node2 ' } ; 

6. 

PARALLEL  =  1; 

cpus  =  { ' nodel ' , ' node2 ' } ; 

Figure  26  -  Example  sequence  of  parameters  for  scaling  parallel  programs. 


MATLAB  ®  is  a  registered  trademark  of  The  Mathworks,  Inc. 


26 


9.  Measuring  Performance 

This  section  covers  some  of  the  basic  methods  to  measure  and  compute  performance  of  parallel 
programs. 

9.1.  Speedup 

The  most  common  measure  of  performance  for  parallel  programs  is  speedup.  For  input  size  n 
and  number  of  processors  p,  speedup  is  the  ratio  between  the  serial  and  parallel  runtimes: 


speedup(n,p ) 


Hme paralleling) 


Most  speedup  curves  can  be  places  into  the  following  categories: 

•  Linear  -  Occurs  when  the  application  runs  p  times  faster  on  p  processors  than  on  a  single 
processor.  In  general,  linear  (also  known  as  ideal )  is  the  theoretical  limit. 

•  Superlinear  -  In  rare  cases,  a  parallel  program  can  achieve  speedup  greater  than  linear. 
This  can  occur  if  the  subproblems  fit  into  cache  or  main  memory,  significantly  reducing 
data  access  times. 

•  Sublinear  -  Occurs  when  adding  more  processors  improves  performance,  but  not  as  fast 
as  linear.  This  is  a  common  speedup  observed  in  many  parallel  programs. 

•  Saturation  -  Occurs  when  communication  time  grows  faster  than  computation  time 
decreases.  Often  there  is  a  “sweet  spot”  when  performance  peaks.  This  is  also  a 
common  speedup  observed  in  parallel  programs. 

If  no  communication  is  required,  then  speedup  should  be  linear  with  N.  These  types  of 
applications  are  known  as  embarrassingly  parallel.  If  communication  is  required,  then  the  non¬ 
communicating  portion  of  the  code  should  speedup  linearly  with  N.  Figure  27  shows  examples 
of  each  of  the  different  types  of  speedups. 


Figure  27  -  Examples  of  typical  speedup  curves. 
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9.2.  Amdahl’s  Law 

Amdahl’s  Law  is  a  useful  principle  for  determining  the  theoretical  maximum  speedup  a  parallel 
program  can  achieve.  As  stated  in  [11],  “Amdahl’s  Law  states  that  the  performance 
improvement  to  be  gained  from  using  some  faster  mode  of  execution  is  limited  by  the  fraction  of 
the  time  the  faster  mode  can  be  used.”  In  our  case,  the  “faster  mode  of  execution”  is 
parallelizing  one  or  more  sections  of  code.  Amdahl’s  Law  can  be  concisely  described  using  the 
following  equation,  where  /is  the  fraction  of  time  spent  on  serial  operations  and  p  is  the  number 
of  processors: 


speedup  <  — — ,  1  v 

From  this  equation,  we  see  that  as  the  number  of  processors  increases,  the  term  (l  -  f)/ p 
approaches  0,  resulting  in  the  following: 

lim  speedup  <  — 

f 

As  we  can  see,  the  maximum  possible  speedup  for  a  parallel  program  depends  on  how  much  of  a 
program  can  be  parallelized.  Consider  three  programs  with  three  different  values  off:  0.1,  0.01, 
and  0.001.  The  maximum  possible  speedups  for  these  programs  are  10,  100,  and  100, 
respectively.  Figure  28  shows  the  speedups  for  these  programs  up  to  1024  processors. 

Amdahl’s  Law  clearly  demonstrates  the  law  of  diminishing  returns:  as  the  number  of  processors 
increases,  the  amount  of  speedup  attained  by  adding  more  processors  decreases. 


Theoretical  Max  Speedup 


Processors 


-  Linear 
f  =  0.001 
f  =  0.01 
f  =  0.1 


Figure  28  -  Theoretical  maximum  speedups  for  three  parallel  programs  with  different  values  of  f. 
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One  common  misconception  about  speedup  is  the  amount  of  benefit  adding  processors  provides. 
With  respect  to  a  single  processor,  16  processors  provide  a  maximum  of  16x  speedup  and  32 
processors  provides  a  maximum  of  32x  speedup.  Obvious,  right?  What  many  parallel 
programmers  miss  is  that  while  16  processors  provide  a  speedup  of  up  to  16x  over  a  single 
processor,  adding  16  more  processors  provides  an  additional  speedup  of  only  2x.  Keep  this  in 
mind  when  determining  how  many  resources  to  use  when  running  a  pMatlab  program. 

10.  Conclusion 

In  this  document,  we  have  introduced  the  following: 

•  Basic  concepts  in  parallel  programming 

•  Basic  programming  constructs  in  pMatlab 

•  How  to  launch,  write,  test,  debug,  and  scale  pMatlab  applications 

•  How  to  measure  performance  of  parallel  programs 

And  yet,  this  is  just  the  beginning.  Before  you  tackle  writing  your  own  pMatlab  application,  take 
a  moment  to  look  other  available  pMatlab  documentation.  For  example,  many  programs  fall  into 
a  class  of  applications  known  as  parameter  sweep  applications  in  which  a  set  of  code,  e.g.  a 
Monte  Carlo  simulation,  is  run  multiple  times  using  different  input  parameters.  [12]  introduces  a 
pMatlab  template  that  can  be  used  to  easily  parallelize  existing  parameter  sweep  applications. 
Additionally,  the  pMatlab  distribution  contains  an  examples  directory  which  contains  example 
programs  that  can  be  run  and  modified. 
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